# Lab 12 - Simulations and hypotheses

We will start this lab by simulating some data using the [NumPy](http://www.numpy.org) package.  The NumPy package is used by Pandas, so should have been installed when we installed Pandas.  However, we still need to import it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

The smoking rate in New York City was 11.5% in 2016.  This statistic means that in theory if we picked 100 New Yorkers at random, we expect 11.6 of them to smoke.  But what happens in practice if we select 100 random New Yorkers?  We are going to simulate this scenario using code.

First we define our population:

In [None]:
population = ["Smoker","Non-smoker"]
pop_prob = [0.116,1-0.116]

This code defines our population as having two groups `Smoker` and `Non-smoker`, with the probabilities 0.116 and 1-0.116 = 0.884, respectively.  

We can generate a random sample of 100 people from our population with the code `np.random.choice(population,p=pop_prob,size=100)`.  Type and run it below.

The code has simulated 100 people and labeled each as a `Smoker` or `Non-smoker` according to the probabilities we gave.  Try re-running this code.  What happens?

Also notice that the word `array` appears at the beginning of the output.  An array is similar to a list, in that it holds many different values of the same type (integers, strings, etc.). This particular array is an `ndarray`, which is one of the kinds of arrays produced by `NumPy`.  We have to convert the array into a Pandas Series before we can use any Pandas function on it.

Next we want to count the number of smokers, which we do by:

1. saving the array into a variable
2. converting the array into a Pandas Series using `Pd.Series(name_of_array)`
3. using value_counts() to count the number of smokers and non-smokers

Can figure out how to write this code?

<details> <summary>Answer:</summary>
    <code>sample_array = np.random.choice(population,p=pop_prob,size=100)
pd.Series(sample_array).value_counts()</code>
</details>

To understand the variation in the number of smokers, we want to repeat the above sampling many times and make a histogram of the counts of smokers.  So we will need to extract the number of smokers from the `value_counts()` output. Modify your code above to save the `value_counts()` output as the variable `counts`. 

Then we can get the number of smokers with the code `counts["Smoker"]`.  Try it below.

In Lab 10, we sampled multiple times from our dataset and made a histogram of the mean, median, or variance of the samples.  Here we want to simulate multiple samples and make a histogram of the counts of smokers.  Try to write this code below.

<details> <summary>Hint (pseudo-code):</summary>
    <code>initialize a list for the counts
loop 200 times:
        simulate a new sample
        count the number of smokers in the sample
        append the smoker count in the latest simulation to your list
turn the list into a Pandas series and make a histogram of it</code>
</details>

<details> <summary>Answer:</summary>
    <code>
all_counts = []
for i in range(200):
    sample_array = np.random.choice(population,p=pop_prob,size=100)
    counts = pd.Series(sample_array).value_counts()
    all_counts.append(counts["Smoker"])
pd.Series(all_counts).hist()</code>
</details>

What do you notice about the histogram?  Run your code again.  How does the histogram change (or not change)?

What happens when you increase the number of simulations?

In most high-income countries, the percentage of births that are boys is 51.2%.  Consider samples of 100 births.  How much variation is there in the number of boys born?  Generate 250 samples of size 100 and plot a histgram of the number of boys born in each sample.

### Hypotheses

A *statistical hypothesis* is a specific, testable assumption about a population that is either true or false.  Next class we will see how to formally test a hypothesis.  For this lab, we will focus on making our hypotheses specific enough that we can get some kind of answer from the data.

Suppose we want to know if taxis with more passengers take longer trips.  We can rephrase this as a statement that is either true or false:

*Taxis with more passengers take longer trips.*

Presumably we can look at trips with lots of passengers and trips with few passengers, and compare how long these trips are.  But what is the cut-off between between lots and few passengers?  What exactly should we compare: the mean trip length?  The median trip length?  Something else?

We will encode our choices to the above questions in the hypothesis:

*Taxis with more than 1 passenger take longer trips on average than taxis with more than 1 passenger.*

Now we can test this hypothesis.  First load the green taxi trip dataset into a dataset.

Next, make a dataframe containing only the green taxi trips with 1 passenger.

What is the mean trip distance for these green taxi trips with only 1 passenger?

Next, make a dataframe containing only green taxi trips with 2 or more passengers.

What is the mean trip distance for the green taxi trips with 2 or more passengers?  

Is there any difference between the mean length of taxi trips with 1 passenger and taxi trips with 2 or more passengers?  Do you think this difference is significant?

#### Challenges:
- Is there any difference between the mean length of taxi trips that begin at JFK or Newark airports?  (RateCodeID is 2 or 3) and the other taxi trips?
- Is there any difference between the median length of taxi trips paid by credit card (Payment_type is 1) and trips paid by cash (Payment_type is 2)?
- Is there any difference in the mean number of passengers in trips taken at night from those taken during the day?